df <- read.csv("C:/Users/Bodoque/Desktop/housing_data_CDMX_v2.csv")
head(df)
## property_type places lat.lon price currency
## 1 apartment MiguelHidalgo 23.634501,-102.552788 5500000 MXN
## 2 house Iztapalapa 19.31033,-99.068557 1512000 MXN
## 3 apartment Tlalpan 19.279771,-99.234597 926667 MXN
## 4 apartment MiguelHidalgo 23.634501,-102.552788 6410000 MXN
## 5 apartment MiguelHidalgo 19.432657,-99.177444 4416000 MXN
## 6 apartment BenitoJuarez 19.367025,-99.170349 3150000 MXN
## price_aprox_local_currency price_aprox_usd surface_total_in_m2
## 1 5450246 289775.66 54
## 2 1498322 79661.96 80
## 3 918284 48822.82 100
## 4 6352013 337720.36 135
## 5 4376052 232663.51 87
## 6 3122244 166001.78 100
## surface_covered_in_m2 price_usd_per_m2 price_per_m2 lat lon
## 1 54 5366.2159 101851.85 23.63450 -102.55279
## 2 80 995.7745 18900.00 19.31033 -99.06856
## 3 100 488.2282 9266.67 19.27977 -99.23460
## 4 135 2501.6323 47481.48 23.63450 -102.55279
## 5 87 2674.2932 50758.62 19.43266 -99.17744
## 6 100 1660.0178 31500.00 19.36703 -99.17035
I will delete the columns that are not useful for this analysis:
df <- df %>%
select(-c(lat.lon,currency,price,price_aprox_local_currency, price_per_m2))
summary(df) # Summary statistics
## property_type places price_usd surface_total_in_m2
## Length:18234 Length:18234 Min. : 5797 Min. : 1
## Class :character Class :character 1st Qu.: 59070 1st Qu.: 67
## Mode :character Mode :character Median : 120652 Median : 95
## Mean : 270398 Mean : 4062
## 3rd Qu.: 250484 3rd Qu.: 195
## Max. :17890000 Max. :65748000
## surface_covered_in_m2 price_usd_per_m2 lat lon
## Min. : 1 Min. :1.000e-03 Min. :19.19 Min. :-102.55
## 1st Qu.: 67 1st Qu.:6.774e+02 1st Qu.:19.35 1st Qu.: -99.20
## Median : 95 Median :1.108e+03 Median :19.39 Median : -99.16
## Mean : 4040 Mean :1.504e+03 Mean :19.46 Mean : -99.22
## 3rd Qu.: 200 3rd Qu.:1.898e+03 3rd Qu.:19.43 3rd Qu.: -99.14
## Max. :65748000 Max. :1.759e+05 Max. :41.58 Max. : -90.49
total_rows <- nrow(df)
total_cols <- ncol(df)
cat("Total rows: ",total_rows)
## Total rows: 18234
cat("\nTotal columns: ",total_cols)
##
## Total columns: 8
The dataset contains 8 columns:
property_type : Type of property (e.g., house,
apartment)places : Borough of Mexico City where the property is
locatedprice_usd : Property price in dollars.surface_total_in_m2 : Total surface area in m2.surface_covered_in_m2 : Built surface in m2.price_usd_per_m2 : Property price per square meter in
USD.lat : Latitude coordinate of the property
location.lon : Longitude coordinate of the property
location.
The histogram above shows the distribution of property prices in USD on a logarithmic scale. Most properties are concentrated between 50,000 and 300,000 USD, with a peak around 100,000–200,000 USD. A smaller number of high-value properties extend the distribution to over 1 million USD, which creates a long right tail. This indicates that the market is dominated by mid-priced properties, while luxury listings are rare but present.
The histogram above shows the distribution of total surface area (in square meters) using a logarithmic scale. Most properties cluster between 70 and 200 m2, with a strong peak around 100 m2, which reflects the typical size of apartments or small houses in Mexico City. A small number of records extend far beyond this range, representing outliers or possibly data entry errors with unrealistically large values.
The boxplot displays the distribution of total surface area (m2) on a logarithmic scale. Most properties are concentrated between 70 and 200 m2, with a median close to 100 m2. However, there are numerous outliers with much larger areas, indicating the presence of unusually large properties or data entry errors.
As we can see, all categories show extreme outliers, reflecting unusually large properties or potential data inconsistencies. Next, take a look at the distribution of covered surface area, it shows a pattern similar to what we observed earlier with the total surface area.
The histogram shows that most properties are concentrated between 1,000 and 5,000 USD per m2, with a clear peak around 2,000 USD per m2. The boxplot confirms this distribution, with the majority of values clustered in a narrow range but with numerous outliers on both ends. This indicates that while the market has a typical mid-range price, there are also extreme cases that deviate significantly from the norm.
There are some values that do not make sense. For example, the column surface_total_in_m2 shows a minimum value of 1, and the column surface_covered_in_m2 also has 1. This is unrealistic because the area would be too small. I am going to explore the data further to determine whether these values need to be deleted.
minimal_area_m2 <- df %>%
filter(surface_total_in_m2 < 70)
nrow(minimal_area_m2)
## [1] 4919
There are 4,919 records with a total surface area below 70 m2. These records will be removed to ensure the dataset contains only realistic values.
df <- df[df$surface_total_in_m2 >= 70, ] # Filter rows where surface_total_in_m2 >= 70. The empty part after the comma means keep all columns.
total_rows <- nrow(df)
cat("Total rows after filtering for surfaces greater than 70 m2: ",total_rows)
## Total rows after filtering for surfaces greater than 70 m2: 13315
We also observed outliers in the upper part of the box plot. This indicates that some properties have unusually large areas. Therefore, I am going to remove those records by applying the interquartile range (IQR) method.
cat("Rows before:", nrow(df), "\n")
## Rows before: 13315
cat("Rows after removing outliers:", nrow(df_no_outliers), "\n")
## Rows after removing outliers: 12187
cat("Removed:", nrow(df) - nrow(df_no_outliers), "rows\n")
## Removed: 1128 rows
Next, we will check for duplicates and null values.
df<-df_no_outliers
df <- df %>% distinct()
cat("Duplicate rows deleted: ", total_rows - nrow(df), "\n")
## Duplicate rows deleted: 2911
cat("Null values: ",sum(is.na(df)))
## Null values: 0
total_rows <- nrow(df)
cat("Total rows: ",total_rows)
## Total rows: 10404
summary(df$surface_total_in_m2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 70 85 120 162 211 492
We need these results to match the covered area, so I will apply a filter that gathers the same information.
df <- df[df$surface_covered_in_m2 >= 70, ]
df <- df[df$surface_covered_in_m2 <= 492, ]
total_rows <- nrow(df)
cat("Total rows: ",total_rows)
## Total rows: 10277
summary(df)
## property_type places price_usd surface_total_in_m2
## Length:9698 Length:9698 Min. : 5797 Min. : 70.0
## Class :character Class :character 1st Qu.: 79567 1st Qu.: 85.0
## Mode :character Mode :character Median : 151210 Median :120.0
## Mean : 261281 Mean :158.9
## 3rd Qu.: 274035 3rd Qu.:208.0
## Max. :12314238 Max. :492.0
## surface_covered_in_m2 price_usd_per_m2 lat lon
## Min. : 70.0 Min. : 14.82 Min. :19.19 Min. :-102.55
## 1st Qu.: 85.0 1st Qu.: 686.35 1st Qu.:19.35 1st Qu.: -99.20
## Median :118.0 Median : 1136.10 Median :19.39 Median : -99.17
## Mean :156.6 Mean : 1556.94 Mean :19.45 Mean : -99.22
## 3rd Qu.:200.0 3rd Qu.: 2025.50 3rd Qu.:19.42 3rd Qu.: -99.14
## Max. :492.0 Max. :175917.69 Max. :23.63 Max. : -90.49
After removing the outliers in surface area, the price histogram looks more reasonable. The area outliers were probably distorting the price distribution, and once removed, the shape now clearly shows where most properties in Mexico City are concentrated. You can see the difference in the histogram below.
However, I can see a maximum value of 12,314,238 USD. That price could be real, but I doubt it, so I will review the record to see whether the surface area coincides or if it is an incorrectly recorded price.
high_price_df <- df %>%
filter(price_usd > 12000000)
nrow(high_price_df)
## [1] 1
head(high_price_df)
## property_type places price_usd surface_total_in_m2
## 1 apartment BenitoJuarez 12314238 70
## surface_covered_in_m2 price_usd_per_m2 lat lon
## 1 70 175917.7 19.36045 -99.15689
It is an apartment with only 70 m2 of covered and total surface area, which is impossible, so I will drop it.
df <- df[df$price_usd <= 12000000, ]
df %>%
arrange(price_usd) %>%
head(10)
## property_type places price_usd surface_total_in_m2
## 1 store AlvaroObregon 5796.88 251
## 2 apartment MiguelHidalgo 7000.00 350
## 3 apartment Cuajimalpa 7114.36 480
## 4 apartment Tlahuac 8693.26 86
## 5 house Xochimilco 15805.94 230
## 6 house Xochimilco 15809.69 200
## 7 apartment AlvaroObregon 15893.77 80
## 8 house Coyoacan 17386.53 70
## 9 house Coyoacan 19005.85 70
## 10 house Coyoacan 19005.85 80
## surface_covered_in_m2 price_usd_per_m2 lat lon
## 1 251 23.09514 19.34652 -99.19129
## 2 350 20.00000 19.42985 -99.18195
## 3 480 14.82158 19.35735 -99.29979
## 4 86 101.08442 19.28339 -99.05504
## 5 230 68.72148 19.24157 -99.11103
## 6 200 79.04845 19.24102 -99.10752
## 7 80 198.67212 19.39081 -99.19541
## 8 70 248.37900 19.31224 -99.10609
## 9 70 271.51214 19.35021 -99.16215
## 10 80 237.57312 19.32377 -99.11656
There are 4 values that are far from the others in terms of price. For the sake of simplifying the data, I will drop them.
df <- df %>%
mutate(
price_usd_per_m2 = price_usd / surface_total_in_m2,
covered_ratio = surface_covered_in_m2 / surface_total_in_m2)
stopifnot(all(df$surface_total_in_m2 >= df$surface_covered_in_m2, na.rm = TRUE))
summary(df[, c("price_usd","price_usd_per_m2","surface_total_in_m2","surface_covered_in_m2","covered_ratio")])
## price_usd price_usd_per_m2 surface_total_in_m2 surface_covered_in_m2
## Min. : 15806 Min. : 68.69 Min. : 70.0 Min. : 70.0
## 1st Qu.: 79575 1st Qu.: 686.68 1st Qu.: 85.0 1st Qu.: 85.0
## Median : 151210 Median : 1136.25 Median :120.0 Median :118.0
## Mean : 260142 Mean : 1539.58 Mean :158.9 Mean :156.6
## 3rd Qu.: 274035 3rd Qu.: 2025.50 3rd Qu.:208.0 3rd Qu.:200.0
## Max. :8429837 Max. :31876.47 Max. :492.0 Max. :492.0
## covered_ratio
## Min. :0.2575
## 1st Qu.:1.0000
## Median :1.0000
## Mean :0.9918
## 3rd Qu.:1.0000
## Max. :1.0000
In the picture above, we can see some points where the price is very high while the total surface area is low. This may be a data entry error or could be explained by the location. In any case, I will analyze these occurrences further.
summary(df[, c("surface_total_in_m2", "price_usd")])
## surface_total_in_m2 price_usd
## Min. : 70.0 Min. : 15806
## 1st Qu.: 85.0 1st Qu.: 79575
## Median :120.0 Median : 151210
## Mean :158.9 Mean : 260142
## 3rd Qu.:208.0 3rd Qu.: 274035
## Max. :492.0 Max. :8429837
df_flags <- df %>%
mutate(
flag_small_highprice = surface_total_in_m2 < 85 & price_usd > 274035,
flag_large_lowprice = surface_total_in_m2 > 208 & price_usd < 79575,
flag_any = flag_small_highprice | flag_large_lowprice
)
cat("Small and very expensive:", sum(df_flags$flag_small_highprice), "\n")
## Small and very expensive: 27
cat("Large and very cheap:", sum(df_flags$flag_large_lowprice), "\n")
## Large and very cheap: 103
cat("Total outliers:", sum(df_flags$flag_any), "\n")
## Total outliers: 130
top_bottom <- df_flags %>%
filter(flag_any) %>%
arrange(desc(price_usd))
bind_rows(
head(top_bottom, 5),
tail(top_bottom, 5)
)
## property_type places price_usd surface_total_in_m2
## 1 apartment Coyoacan 790297.25 70
## 2 house Coyoacan 726546.61 70
## 3 apartment MiguelHidalgo 500000.00 76
## 4 apartment Cuauhtemoc 395945.99 79
## 5 house BenitoJuarez 385138.19 80
## 6 apartment Cuauhtemoc 27923.83 370
## 7 apartment Cuauhtemoc 27676.20 380
## 8 apartment Cuauhtemoc 26994.23 393
## 9 apartment Cuauhtemoc 21795.44 281
## 10 house Xochimilco 15805.94 230
## surface_covered_in_m2 price_usd_per_m2 lat lon covered_ratio
## 1 70 11289.96071 19.31608 -99.12644 1
## 2 70 10379.23729 19.35381 -99.16361 1
## 3 76 6578.94737 19.40172 -99.23482 1
## 4 79 5011.97456 19.43072 -99.15837 1
## 5 80 4814.22738 19.39551 -99.18424 1
## 6 370 75.46981 19.45034 -99.12606 1
## 7 380 72.83211 19.44784 -99.14040 1
## 8 393 68.68761 19.44682 -99.13039 1
## 9 281 77.56384 19.45178 -99.15709 1
## 10 230 68.72148 19.24157 -99.11103 1
## flag_small_highprice flag_large_lowprice flag_any
## 1 TRUE FALSE TRUE
## 2 TRUE FALSE TRUE
## 3 TRUE FALSE TRUE
## 4 TRUE FALSE TRUE
## 5 TRUE FALSE TRUE
## 6 FALSE TRUE TRUE
## 7 FALSE TRUE TRUE
## 8 FALSE TRUE TRUE
## 9 FALSE TRUE TRUE
## 10 FALSE TRUE TRUE
There are 130 records that meet the established conditions. Before deciding to drop them, I will plot them by borough and property type.
The five boroughs with the most values are Cuauhtémoc, Benito Juárez, Gustavo A. Madero, Miguel Hidalgo, and Álvaro Obregón.
Apartments and houses are the ones with the most flagged outliers.
In the plot above, we see that stores and houses almost always have low prices, whereas apartments tend to have higher prices.
In the plot above, we see that stores are the property type with the smallest total surface area, followed by houses, and then apartments. Apartments range from approximately 70 m2 to the maximum recorded value of 492 m2.
df_flags %>%
ggplot(aes(x = flag_any, y = price_usd_per_m2)) +
geom_boxplot(fill = "#D0BBFC", outlier.alpha = 0.3) +
scale_y_log10(labels = scales::comma) +
labs(title = "Comparison of USD/m2 between normal and flagged records",
x = "Flagged as outlier", y = "USD per m2 (log)") +
theme_minimal()
The boxplot compares the distribution of price per m2 between normal records (FALSE) and those flagged as outliers (TRUE).
Normal records show a median around 1,000 USD/m2, with most values between 500 and 2,000 USD/m2, which is consistent with typical market ranges in Mexico City.
Flagged records, however, are shifted downward, with a median closer to 500 USD/m2. They also present extreme high values (>10,000 USD/m2), suggesting inconsistencies such as very cheap large properties or very expensive small ones.
This confirms that the flagged group behaves very differently from the main distribution, supporting the case for excluding them.
df_clean <- df_flags %>%
filter(!flag_any) %>%
select(-starts_with("flag_"))
cat("Rows before:", nrow(df_flags), "\n")
## Rows before: 9693
cat("Rows after removing outliers:", nrow(df_clean), "\n")
## Rows after removing outliers: 9563
cat("Removed:", nrow(df_flags) - nrow(df_clean), "rows\n")
## Removed: 130 rows
df <- df_clean
p1 <- ggplot(df_bf, aes(x = price_aprox_usd)) +
geom_histogram(bins = 60, fill = "#D0BBFC", color = "white") +
scale_x_log10(labels = scales::comma) +
labs(title = "Before Cleaning", x = "Price (USD, log10)", y = "Count") +
theme_minimal()
p2 <- ggplot(df, aes(x = price_usd)) +
geom_histogram(bins = 60, fill = "#9AD0EC", color = "white") +
scale_x_log10(labels = scales::comma) +
labs(title = "After Cleaning", x = "Price (USD, log10)", y = "Count") +
theme_minimal()
p1 + p2
Before cleaning: the distribution is wider and strongly skewed to the right, with extreme values above 5 million USD. These outliers inflate the tail of the distribution and make the market appear more dispersed than it really is.
After cleaning: the distribution becomes more compact and symmetric. Most properties are concentrated between 50,000 and 500,000 USD, with fewer extreme values.
## # A tibble: 16 × 6
## places n median_price_usd mean_price_usd median_ppm2 mean_ppm2
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 MiguelHidalgo 1311 526865. 664846. 2581. 2988.
## 2 Cuajimalpa 469 463641. 573380. 2222. 2293.
## 3 BenitoJuarez 2325 160852. 186200. 1581. 1575.
## 4 Cuauhtemoc 961 127005. 192282. 1133. 1541.
## 5 AlvaroObregon 1190 160047. 225132. 1100. 1366.
## 6 Tlalpan 592 152809. 210642. 967. 1132.
## 7 MagdalenaContrer… 191 177821. 230125. 905. 1130.
## 8 Coyoacan 767 115910. 154848. 887. 1056.
## 9 Xochimilco 150 141592. 151826. 770. 816.
## 10 Azcapotzalco 303 72391. 94553. 704. 736.
## 11 VenustianoCarran… 179 71864. 97306. 681. 773.
## 12 Iztacalco 178 76149. 98262. 657. 710.
## 13 GustavoAMadero 522 102751. 122195. 656. 794.
## 14 Tlahuac 59 86383. 102737. 652. 658.
## 15 Iztapalapa 365 83791. 98413. 618. 697.
## 16 MilpaAlta 1 39524. 39524. 395. 395.
The most expensive boroughs are Miguel Hidalgo (2,580 USD/m2) and Cuajimalpa (2,220 USD/m2), reflecting luxury markets in Polanco, Lomas, and Santa Fe.
A second tier includes Benito Juárez (1,580 USD/m2), Cuauhtémoc (1,130 USD/m2), and Álvaro Obregón (1,100 USD/m2), characterized by central locations and high demand.
Mid-range prices are observed in southern boroughs such as Tlalpan, Magdalena Contreras, and Coyoacán.
Finally, eastern boroughs like Iztapalapa, Tláhuac and Gustavo A. Madero show the lowest medians (600–660 USD/m2), pointing to more affordable markets.
High-value boroughs such as Miguel Hidalgo, Cuajimalpa, and Benito Juárez have the highest medians and wider spreads, reflecting the premium real estate markets in areas like Polanco, Santa Fe, and other neighborhoods.
Mid-range boroughs like Coyoacán, Magdalena Contreras, and Tlalpan display moderate prices with more variability, suggesting a mix of affordable and upscale areas.
Boroughs with lower values, such as Tláhuac, Iztapalapa, and Milpa Alta, display consistently lower medians and narrower distributions, indicating more affordable housing markets.
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
There are some points that are incorrectly located, so I will drop them.
# Apply the filter before dropping anything to check that the record does not match the locations.
df_filtered <- df %>% filter(surface_total_in_m2 == 110 & places == "AlvaroObregon" & price_usd_per_m2 >= 1504 & price_usd_per_m2 <= 1505)
df_filtered %>% head(19)
## property_type places price_usd surface_total_in_m2
## 1 house AlvaroObregon 165474.8 110
## surface_covered_in_m2 price_usd_per_m2 lat lon covered_ratio
## 1 110 1504.316 19.53541 -99.24323 1
nrow(df_filtered)
## [1] 1
df <- df %>%
filter(!(surface_total_in_m2 == 195 & places == "GustavoAMadero" & price_usd == 151737.07))
df <- df %>%
filter(!(surface_total_in_m2 == 300 & places == "AlvaroObregon" & price_usd >= 165435 & price_usd <= 165436))
df <- df %>%
filter(!(surface_total_in_m2 == 110 & places == "AlvaroObregon" & price_usd_per_m2 >= 1504 & price_usd_per_m2 <= 1505))
df <- df %>%
filter(!(surface_total_in_m2 == 364 & places == "Cuajimalpa" & price_usd_per_m2 >= 694))
df <- df %>%
filter(!(surface_total_in_m2 == 284 & places == "Cuajimalpa" & price_usd_per_m2 >= 408))
df <- df %>%
mutate(
places = ifelse(surface_total_in_m2 == 125 & places == "Cuauhtemoc" & price_usd_per_m2 >= 1970 & price_usd_per_m2 <=1972,
"Xochimilco", places))
library(broom)
model_df <- df %>%
mutate(log_price = log(price_usd),
log_total = log(surface_total_in_m2)) %>%
filter(is.finite(log_price), is.finite(log_total)) %>%
select(log_price, log_total, property_type, places, covered_ratio)
m1 <- lm(log_price ~ log_total + covered_ratio + property_type + places, data = model_df)
glance(m1) # R2, adj.R2, etc.
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.623 0.622 0.573 788. 0 20 -8226. 16497. 16654.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(m1) %>% arrange(p.value) %>% head(12)
## # A tibble: 12 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.65 0.141 40.1 0
## 2 log_total 1.23 0.0136 90.6 0
## 3 property_typehouse -0.409 0.0166 -24.6 6.25e-130
## 4 placesMiguelHidalgo 0.547 0.0240 22.8 5.43e-112
## 5 placesIztapalapa -0.459 0.0346 -13.3 6.78e- 40
## 6 placesGustavoAMadero -0.389 0.0304 -12.8 3.09e- 37
## 7 placesCuajimalpa 0.399 0.0324 12.3 1.53e- 34
## 8 placesAzcapotzalco -0.453 0.0370 -12.2 3.19e- 34
## 9 placesIztacalco -0.456 0.0461 -9.89 6.22e- 23
## 10 placesVenustianoCarranza -0.422 0.0460 -9.16 6.37e- 20
## 11 placesBenitoJuarez 0.166 0.0208 7.98 1.69e- 15
## 12 placesTlahuac -0.536 0.0765 -7.01 2.60e- 12